Goto

Collaborating Authors

 ranking function


Calibration Bands for Mean Estimates within the Exponential Dispersion Family

Delong, Łukasz, Gatti, Selim, Wüthrich, Mario V.

arXiv.org Machine Learning

Calibration Bands for Mean Estimates within the Exponential Dispersion Family null Lukasz Delong Selim Gatti Mario V. W uthrich Version of October 8, 2025 Abstract A statistical model is said to be calibrated if the resulting mean estimates perfectly match the true means of the underlying responses. Aiming for calibration is often not achievable in practice as one has to deal with finite samples of noisy observations. A weaker notion of calibration is auto-calibration. An auto-calibrated model satisfies that the expected value of the responses for a given mean estimate matches this estimate. Testing for auto-calibration has only been considered recently in the literature and we propose a new approach based on calibration bands. Calibration bands denote a set of lower and upper bounds such that the probability that the true means lie simultaneously inside those bounds exceeds some given confidence level. Such bands were constructed by Yang-Barber (2019) for sub-Gaussian distributions. Dimitriadis et al. (2023) then introduced narrower bands for the Bernoulli distribution. We use the same idea in order to extend the construction to the entire exponential dispersion family that contains for example the binomial, Poisson, negative binomial, gamma and normal distributions. Moreover, we show that the obtained calibration bands allow us to construct various tests for calibration and auto-calibration, respectively. As the construction of the bands does not rely on asymptotic results, we emphasize that our tests can be used for any sample size. Auto-calibration, calibration, calibration bands, exponential dispersion family, mean estimation, regression modeling, binomial distribution, Poisson distribution, negative binomial distribution, gamma distribution, normal distribution inverse Gaussian distribution. 1 Introduction Various statistical methods can be used to derive mean estimates from available observations, and it is important to understand whether these mean estimates are reliable for decision making. A statistical model is said to be calibrated if the resulting mean estimates perfectly match the true means of the underlying responses. In practice, calibration is often not achievable, as estimates are obtained from finite samples of noisy observations.


Adaptive Out-of-Control Point Pattern Detection in Sequential Random Finite Set Observations

Bourazas, Konstantinos, Papaioannou, Savvas, Kolios, Panayiotis

arXiv.org Artificial Intelligence

-- In this work we introduce a novel adaptive anomaly detection framework specifically designed for monitoring sequential random finite set (RFS) observations. Our approach effectively distinguishes between In-Control data (normal) and Out-Of-Control data (anomalies) by detecting deviations from the expected statistical behavior of the process. The primary contributions of this study include the development of an innovative RFS-based framework that not only learns the normal behavior of the data-generating process online but also dynamically adapts to behavioral shifts to accurately identify abnormal point patterns. T o achieve this, we introduce a new class of RFS-based posterior distributions, named Power Discounting Posteriors (PD), which facilitate adaptation to systematic changes in data while enabling anomaly detection of point pattern data through a novel predictive posterior density function. The effectiveness of the proposed approach is demonstrated by extensive qualitative and quantitative simulation experiments.






RLRF: Competitive Search Agent Design via Reinforcement Learning from Ranker Feedback

Mordo, Tommy, Dekel, Sagie, Madmon, Omer, Tennenholtz, Moshe, Kurland, Oren

arXiv.org Artificial Intelligence

Competitive search is a setting where document publishers modify them to improve their ranking in response to a query. Recently, publishers have increasingly leveraged LLMs to generate and modify competitive content. We introduce Reinforcement Learning from Ranker Feedback (RLRF), a framework that trains LLMs using preference datasets derived from ranking competitions. The goal of a publisher (LLM-based) agent is to optimize content for improved ranking while accounting for the strategies of competing agents. We generate the datasets using approaches that do not rely on human-authored data. We show that our proposed agents consistently and substantially outperform previously suggested approaches for LLM-based competitive document modification. We further show that our agents are effective with ranking functions they were not trained for (i.e., out of distribution) and they adapt to strategic opponents. These findings provide support to the significant potential of using reinforcement learning in competitive search.


Export Reviews, Discussions, Author Feedback and Meta-Reviews

Neural Information Processing Systems

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. The authors propose to formalize the notion that the ranking function depends only on the object features, and not the order in which the documents are presented. This is a good idea, but the proposed notion of exchangeability is too strict in my opinion: we can capture the intended notion without the strict equality in eqn 1 and 2. We just want the order of the scores to be preserved, not their exact values. In terms of clarity, there are sections that are quite unclear, as pointed out below. You should define symmetric function in the proof of Thm 3.2.


Semantic Bridges Between First Order c-Representations and Cost-Based Semantics: An Initial Perspective

Leisegang, Nicholas, Casini, Giovanni, Meyer, Thomas

arXiv.org Artificial Intelligence

Weighted-knowledge bases and cost-based semantics represent a recent formalism introduced by Bienvenu et al. for Ontology Mediated Data Querying in the case where a given knowledge base is inconsistent. This is done by adding a weight to each statement in the knowledge base (KB), and then giving each DL interpretation a cost based on how often it breaks rules in the KB. In this paper we compare this approach with c-representations, a form of non-monotonic reasoning originally introduced by Kern-Isberner. c-Representations describe a means to interpret defeasible concept inclusions in the first-order case. This is done by assigning a numerical ranking to each interpretations via penalties for each violated conditional. We compare these two approaches on a semantic level. In particular, we show that under certain conditions a weighted knowledge base and a set of defeasible conditionals can generate the same ordering on interpretations, and therefore an equivalence of semantic structures up to relative cost. Moreover, we compare entailment described in both cases, where certain notions are equivalently expressible in both formalisms. Our results have the potential to benefit further work on both cost-based semantics and c-representations


A-VERT: Agnostic Verification with Embedding Ranking Targets

Aguirre, Nicolás, Caso, Ramiro, Colmeiro, Ramiro Rodríguez, Santelli, Mauro, Calderón, Joaquín Toranzo

arXiv.org Artificial Intelligence

The automatic evaluation of Language Model (LM) responses is a critical piece in the development of benchmarks and metrics, both for model training and quality assessment of production model endpoints. The current approaches to response classification relies on methods that are too expensive (i.e. LLM-as-a-Judge) or that are far from real-world conditions (string-matching, logprob). In this paper, a structure-free evaluation method is presented. The method makes use of semantic embedding distances to match target candidates with arbitrary LM-generated text, resulting in a robust classification of the response at a relatively low compute cost (embedding models of less than $10B$ parameters). The results show a regression score of ~0.97 and an accuracy of ~96% against human annotators, tested over 3 data sets and 3 different LM architectures.